Data Exploration part is based on code from https://www.kaggle.com/code/muhammadfaizan65/machine-failure-prediction-eda-modeling. Added new models and comparisons. Data can be found here https://www.kaggle.com/datasets/umerrtx/machine-failure-prediction-using-sensor-data?resource=download.¶
Dataset Overview¶
This dataset contains sensor data collected from various machines, to predict machine failures in advance. It includes a variety of sensor readings as well as recorded machine failures.
Columns Description¶
footfall: The number of people or objects passing by the machine.
tempMode: The temperature mode or setting of the machine.
AQ: Air quality index near the machine.
USS: Ultrasonic sensor data, indicating proximity measurements.
CS: Current sensor readings, indicating the electrical current usage of the machine.
VOC: Volatile organic compounds level detected near the machine.
RP: Rotational position or RPM (revolutions per minute) of the machine parts.
IP: Input pressure to the machine.
Temperature: The operating temperature of the machine.
fail: Binary indicator of machine failure (1 for failure, 0 for no failure).
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score
# Deep learning and gradient boosting libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import xgboost as xgb
# Load the dataset
file_path = "data.csv"
data = pd.read_csv(file_path)
# Display basic info and summary
print(data.info())
print(data.describe())
print(data.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 footfall 944 non-null int64
1 tempMode 944 non-null int64
2 AQ 944 non-null int64
3 USS 944 non-null int64
4 CS 944 non-null int64
5 VOC 944 non-null int64
6 RP 944 non-null int64
7 IP 944 non-null int64
8 Temperature 944 non-null int64
9 fail 944 non-null int64
dtypes: int64(10)
memory usage: 73.9 KB
None
footfall tempMode AQ USS CS \
count 944.000000 944.000000 944.000000 944.000000 944.000000
mean 306.381356 3.727754 4.325212 2.939619 5.394068
std 1082.606745 2.677235 1.438436 1.383725 1.269349
min 0.000000 0.000000 1.000000 1.000000 1.000000
25% 1.000000 1.000000 3.000000 2.000000 5.000000
50% 22.000000 3.000000 4.000000 3.000000 6.000000
75% 110.000000 7.000000 6.000000 4.000000 6.000000
max 7300.000000 7.000000 7.000000 7.000000 7.000000
VOC RP IP Temperature fail
count 944.000000 944.000000 944.000000 944.000000 944.000000
mean 2.842161 47.043432 4.565678 16.331568 0.416314
std 2.273337 16.423130 1.599287 5.974781 0.493208
min 0.000000 19.000000 1.000000 1.000000 0.000000
25% 1.000000 34.000000 3.000000 14.000000 0.000000
50% 2.000000 44.000000 4.000000 17.000000 0.000000
75% 5.000000 58.000000 6.000000 21.000000 1.000000
max 6.000000 91.000000 7.000000 24.000000 1.000000
(944, 10)
# Check for missing values
print(data.isnull().sum())
footfall 0 tempMode 0 AQ 0 USS 0 CS 0 VOC 0 RP 0 IP 0 Temperature 0 fail 0 dtype: int64
# Distribution of numeric columns
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns)
for i, column in enumerate(data.columns):
row = i // 2 + 1
col = i % 2 + 1
hist = px.histogram(data, x=column, template='plotly_dark', color_discrete_sequence=['#F63366'])
hist.update_traces(marker_line_width=0.5, marker_line_color="white")
fig.add_trace(hist.data[0], row=row, col=col)
fig.update_layout(height=1200, title_text="Distribution of Numeric Columns", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
# Correlation Heatmap
corr = data.corr()
fig = ff.create_annotated_heatmap(
z=corr.values,
x=list(corr.columns),
y=list(corr.index),
annotation_text=corr.round(2).values,
showscale=True,
colorscale='Viridis')
fig.update_layout(title_text='Correlation Heatmap', title_font=dict(size=25), title_x=0.5)
fig.show()
# Boxplots for each feature to identify outliers
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns[:-1])
for i, column in enumerate(data.columns[:-1]): # Excluding the target column 'fail'
row = i // 2 + 1
col = i % 2 + 1
box = px.box(data, y=column, template='plotly_dark', color_discrete_sequence=['#636EFA'])
box.update_traces(marker_line_width=0.5, marker_line_color="white")
fig.add_trace(box.data[0], row=row, col=col)
fig.update_layout(height=1200, title_text="Boxplots of Features", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
# Scatter plots to visualize relationships between features and target
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns[:-1])
for i, column in enumerate(data.columns[:-1]): # Excluding the target column 'fail'
row = i // 2 + 1
col = i % 2 + 1
scatter = px.scatter(data, x=column, y='fail', template='plotly_dark', color='fail', color_continuous_scale='Viridis')
scatter.update_traces(marker=dict(size=5, opacity=0.7, line=dict(width=0.5, color='white')))
fig.add_trace(scatter.data[0], row=row, col=col)
fig.update_layout(height=1200, title_text="Scatter Plots of Features vs Fail", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
# Data Preprocessing
X = data.drop(columns=['fail'])
y = data['fail']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Model Training and Evaluation Function
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
# Train the model
if model_name == 'Neural Network':
# Neural Network specific training
model.fit(X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=[EarlyStopping(patience=10)],
verbose=0)
y_pred = (model.predict(X_test) > 0.5).astype(int).flatten()
y_prob = model.predict(X_test).flatten()
elif model_name == 'XGBoost':
# XGBoost specific training with DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# XGBoost training parameters
params = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'random_state': 42
}
# Use watchlist for early stopping
watchlist = [(dtrain, 'train'), (dtest, 'eval')]
model = xgb.train(
params,
dtrain,
num_boost_round=100, # max number of boosting iterations
evals=watchlist,
early_stopping_rounds=10,
verbose_eval=False
)
y_pred = (model.predict(dtest) > 0.5).astype(int)
y_prob = model.predict(dtest)
else:
# Scikit-learn models
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)
cm = confusion_matrix(y_test, y_pred)
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
return {
'model': model,
'accuracy': accuracy,
'report': report,
'confusion_matrix': cm,
'fpr': fpr,
'tpr': tpr,
'roc_auc': roc_auc
}
# Initialize models
models = {
'Neural Network': Sequential([
Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.3),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(1, activation='sigmoid')
]),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'XGBoost': xgb.XGBClassifier(random_state=42, use_label_encoder=False)
}
# Compile Neural Network
models['Neural Network'].compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
/Users/rytis/miniconda3/lib/python3.12/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
# Hyperparameter grid
params_grid = {
'Decision Tree': {'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10]},
'Random Forest': {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]},
'XGBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
}
# Store results
results = {}
# Evaluate models
for name, model in models.items():
print(f"\nEvaluating {name}")
# For Neural Network and XGBoost, we'll use a different approach
if name == 'Neural Network':
# Neural Network doesn't use GridSearchCV easily
results[name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test, name)
elif name == 'XGBoost':
# XGBoost also uses a different cross-validation approach
results[name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test, name)
else:
# Scikit-learn models with GridSearchCV
grid = GridSearchCV(model, params_grid[name], cv=5, n_jobs=-1)
grid.fit(X_train_scaled, y_train)
best_model = grid.best_estimator_
results[name] = evaluate_model(best_model, X_train_scaled, X_test_scaled, y_train, y_test, name)
print(f"Best parameters for {name}: {grid.best_params_}")
Evaluating Neural Network 6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Evaluating Decision Tree Best parameters for Decision Tree: {'max_depth': 5, 'min_samples_split': 10} Evaluating Random Forest Best parameters for Random Forest: {'max_depth': None, 'n_estimators': 100} Evaluating XGBoost
# Print detailed results
for name, result in results.items():
print(f"\n{name} Results:")
print(f"Accuracy: {result['accuracy']}")
print("Classification Report:")
print(pd.DataFrame(result['report']).transpose())
Neural Network Results:
Accuracy: 0.873015873015873
Classification Report:
precision recall f1-score support
0 0.882353 0.882353 0.882353 102.000000
1 0.862069 0.862069 0.862069 87.000000
accuracy 0.873016 0.873016 0.873016 0.873016
macro avg 0.872211 0.872211 0.872211 189.000000
weighted avg 0.873016 0.873016 0.873016 189.000000
Decision Tree Results:
Accuracy: 0.8624338624338624
Classification Report:
precision recall f1-score support
0 0.880000 0.862745 0.871287 102.000000
1 0.842697 0.862069 0.852273 87.000000
accuracy 0.862434 0.862434 0.862434 0.862434
macro avg 0.861348 0.862407 0.861780 189.000000
weighted avg 0.862829 0.862434 0.862534 189.000000
Random Forest Results:
Accuracy: 0.8783068783068783
Classification Report:
precision recall f1-score support
0 0.891089 0.882353 0.886700 102.000000
1 0.863636 0.873563 0.868571 87.000000
accuracy 0.878307 0.878307 0.878307 0.878307
macro avg 0.877363 0.877958 0.877635 189.000000
weighted avg 0.878452 0.878307 0.878355 189.000000
XGBoost Results:
Accuracy: 0.8465608465608465
Classification Report:
precision recall f1-score support
0 0.868687 0.843137 0.855721 102.000000
1 0.822222 0.850575 0.836158 87.000000
accuracy 0.846561 0.846561 0.846561 0.846561
macro avg 0.845455 0.846856 0.845940 189.000000
weighted avg 0.847298 0.846561 0.846716 189.000000
# Plotting ROC Curves
fig = go.Figure()
for name, result in results.items():
fig.add_trace(go.Scatter(
x=result['fpr'],
y=result['tpr'],
mode='lines',
name=f'{name} (AUC = {result["roc_auc"]:.2f})',
line=dict(width=2)
))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(dash='dash', color='gray'), name='Random'))
fig.update_layout(
title_text='Receiver Operating Characteristic (ROC) Curve',
title_font=dict(size=25),
xaxis_title='False Positive Rate',
yaxis_title='True Positive Rate',
template='plotly_dark'
)
fig.show()
Receiver Operating Characteristic (ROC) Curve¶
Interpretation¶
The ROC curve displayed above shows the performance of the RandomForest classifier on the test dataset. The curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Key Points:¶
- True Positive Rate (TPR): Also known as Sensitivity or Recall, it is the ratio of correctly predicted positive observations to the actual positives.
- False Positive Rate (FPR): It is the ratio of incorrectly predicted positive observations to the actual negatives.
Analysis:¶
- A perfect classifier would have an AUC of 1.0, while a classifier with no discriminative power would have an AUC of 0.5 (represented by the dashed line labeled "Random").
- The ROC curve is very close to the top left corner, demonstrating that the model has a high TPR and a low FPR, meaning it correctly identifies a large proportion of positive cases while keeping false positives to a minimum.